The data was too large to upload to Github, so instead we’re providing you a link to our data:
https://drive.google.com/open?id=1U8RZsuU_aNml-WSTOBIxGy_92YEj4ebK
1. loan.csv ==> original data from the LendingClub
2. loan.RData ==> we converted the file above to .RData file since it’s much faster to load and that’s what the Rmarkdown actually uses
3. models.RData ==> we saved our training model results since training the models was very time consuming (and we’ve set up our rmarkdown to read these results instead of re-training)
3. LCDataDictionary.xlsx ==> data dictionary for this data set
rm(list = ls())
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(tidyverse)
library(funModeling)
library(caret)
library(VIM)
library(mice)
library(ggcorrplot)
library(plotly)
library(pROC)
library(lubridate)
library(glmnet)
library(broom)
library(MASS)
library(usmap)
library(RColorBrewer)
set.seed(1)
set_dir <- '/Users/shuheimiyasaka/Google Drive/BST 260 Final Project/For Submission'
As graduate students, debt is all around us. Fortunately, we are also data scientists, and data is all around us, too. Most of the major milestones in life - such as graduating from college or a graduate program, or buying a house - require taking on debt. A report from Time’s Money Magazine based on the Federal Reserve’s Survey of Consumer Finances found that on average, Americans under 35 owe $67,400. For middle-aged Americans, the average is even higher, ranging from $108,300 for 55-64 year olds to $134,600 for 45-54 year olds (http://time.com/money/5233033/average-debt-every-age/).
We are interested in digging deeper into how loan applications are considered, and better understanding factors which might be considered by lending agents when applying for a loan, from the applicant’s perspective. Alternatively, our analysis could be useful from the lender’s perspective in identifying factors which predict loan defaults.
The goal for our final project was to build a prediction model using the LendingClub loans data set. We wanted to build a model that could predict a dichotomized outcome of loan status (which will be explained in more detail below) using the variables given in the data set. In this page, we will walk you through our process for building our final prediction model.
Our data is available via kaggle at https://www.kaggle.com/wendykan/lending-club-loan-data. This data is unique from most other financial institution’s data because of the lending method used by ‘LendingClub’. Headquartered in San Francisco, LendingClub connects borrowers applying for personal loans, auto refinancing, business loans, and elective medical procedures with investors. LendingClub reports that it is America’s largest online marketplace, and emphasizes the digital aspects of its model (https://www.lendingclub.com/). LendingClub also services the loans, and therefore maintains data on the loans’ statuses, as well as information about the loan application.
setwd(set_dir)
load('./loan.Rdata')
#loan.dat <- read.csv('loan.csv', header = TRUE)
#save(loan.dat, file = "./loan.RData")
The data set has 887,379 records with 74 variables.
dim(loan.dat)
## [1] 887379 74
names(loan.dat)
## [1] "id" "member_id"
## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
## [29] "mths_since_last_delinq" "mths_since_last_record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_status"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_mths_ex_med"
## [51] "mths_since_last_major_derog" "policy_code"
## [53] "application_type" "annual_inc_joint"
## [55] "dti_joint" "verification_status_joint"
## [57] "acc_now_delinq" "tot_coll_amt"
## [59] "tot_cur_bal" "open_acc_6m"
## [61] "open_il_6m" "open_il_12m"
## [63] "open_il_24m" "mths_since_rcnt_il"
## [65] "total_bal_il" "il_util"
## [67] "open_rv_12m" "open_rv_24m"
## [69] "max_bal_bc" "all_util"
## [71] "total_rev_hi_lim" "inq_fi"
## [73] "total_cu_tl" "inq_last_12m"
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]
As part of data exploration, we examined with percentage of “zeros”, missing records, and unique values in the data set per variable as shown above. From the table above, we notice a number of variables with significant amount of missing data.
Based on examining the data set and reading the data dictionary, we decided to immediately rule out the following variables from our model: id, member_id, url, and desc.
cols.2.remove <- c('id', 'member_id', 'url', 'desc')
We decided to exclude variables with more than 10% missing data (19 variables).
missing.data.col <- meta_loans$variable[meta_loans$p_na > 10.]
missing.data.col
## [1] "mths_since_last_delinq" "mths_since_last_record"
## [3] "mths_since_last_major_derog" "annual_inc_joint"
## [5] "dti_joint" "open_acc_6m"
## [7] "open_il_6m" "open_il_12m"
## [9] "open_il_24m" "mths_since_rcnt_il"
## [11] "total_bal_il" "il_util"
## [13] "open_rv_12m" "open_rv_24m"
## [15] "max_bal_bc" "all_util"
## [17] "inq_fi" "total_cu_tl"
## [19] "inq_last_12m"
length(missing.data.col)
## [1] 19
cols.2.remove <- c(cols.2.remove, missing.data.col)
meta_loans[order(meta_loans$unique),]
cols.2.remove <- c(cols.2.remove, 'policy_code')
We also decided to remove policy_code since it only has one unique value.
At this point, we had 50 potential covariates:
cols.2.keep <- !(colnames(loan.dat) %in% cols.2.remove)
colnames(loan.dat)[cols.2.keep]
## [1] "loan_amnt" "funded_amnt"
## [3] "funded_amnt_inv" "term"
## [5] "int_rate" "installment"
## [7] "grade" "sub_grade"
## [9] "emp_title" "emp_length"
## [11] "home_ownership" "annual_inc"
## [13] "verification_status" "issue_d"
## [15] "loan_status" "pymnt_plan"
## [17] "purpose" "title"
## [19] "zip_code" "addr_state"
## [21] "dti" "delinq_2yrs"
## [23] "earliest_cr_line" "inq_last_6mths"
## [25] "open_acc" "pub_rec"
## [27] "revol_bal" "revol_util"
## [29] "total_acc" "initial_list_status"
## [31] "out_prncp" "out_prncp_inv"
## [33] "total_pymnt" "total_pymnt_inv"
## [35] "total_rec_prncp" "total_rec_int"
## [37] "total_rec_late_fee" "recoveries"
## [39] "collection_recovery_fee" "last_pymnt_d"
## [41] "last_pymnt_amnt" "next_pymnt_d"
## [43] "last_credit_pull_d" "collections_12_mths_ex_med"
## [45] "application_type" "verification_status_joint"
## [47] "acc_now_delinq" "tot_coll_amt"
## [49] "tot_cur_bal" "total_rev_hi_lim"
length(colnames(loan.dat)[cols.2.keep])
## [1] 50
loan.dat <- loan.dat[, cols.2.keep]
We also decided to remove 6 records with missing or zero annual income since we felt this information was a requirement for obtaining a loan and a covariate that we must definitely include in our final model (and didn’t feel we could impute these values properly)!
query = loan.dat$annual_inc == 0.
query.na = is.na(query)
if (sum(query.na) > 0){
query[query.na] = TRUE
}
if (sum(query) > 0){
loan.dat = loan.dat[!query,]
} else stop('unexpected case')
With the remaining set of records and covariates, we decided to examine the pairwise correlation of covariates:
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']
cor.dat <- cor(loan.dat[,numeric_cols], loan.dat[,numeric_cols])
plot_ly(x=colnames(cor.dat),
y=rownames(cor.dat),
z = cor.dat, type = "heatmap", colorscale="Greys")
#ggcorrplot(cor(loan.dat[,numeric_cols]))
#aggr(loan.dat, combined=T, cex.axis=0.6)
We notice from the plot above that there are a few covariates that are highly correlated (which is not unexpected).
We also calculated basic summary statistics of our covariates to help us better understand the data:
summary(loan.dat)
## loan_amnt funded_amnt funded_amnt_inv term
## Min. : 500 Min. : 500 Min. : 0 36 months:621119
## 1st Qu.: 8000 1st Qu.: 8000 1st Qu.: 8000 60 months:266254
## Median :13000 Median :13000 Median :13000
## Mean :14755 Mean :14742 Mean :14703
## 3rd Qu.:20000 3rd Qu.:20000 3rd Qu.:20000
## Max. :35000 Max. :35000 Max. :35000
##
## int_rate installment grade sub_grade
## Min. : 5.32 Min. : 15.67 A:148198 B3 : 56323
## 1st Qu.: 9.99 1st Qu.: 260.71 B:254535 B4 : 55626
## Median :12.99 Median : 382.55 C:245859 C1 : 53387
## Mean :13.25 Mean : 436.72 D:139541 C2 : 52235
## 3rd Qu.:16.20 3rd Qu.: 572.60 E: 70705 C3 : 50161
## Max. :28.99 Max. :1445.46 F: 23046 C4 : 48857
## G: 5489 (Other):570784
## emp_title emp_length home_ownership
## : 51451 10+ years:291569 ANY : 3
## Teacher : 13469 2 years : 78870 MORTGAGE:443555
## Manager : 11240 < 1 year : 70601 NONE : 46
## Registered Nurse: 5525 3 years : 70026 OTHER : 182
## Owner : 5376 1 year : 57095 OWN : 87470
## RN : 5355 5 years : 55704 RENT :356117
## (Other) :794957 (Other) :263508
## annual_inc verification_status issue_d
## Min. : 1200 Not Verified :266744 Oct-2015: 48631
## 1st Qu.: 45000 Source Verified:329558 Jul-2015: 45962
## Median : 65000 Verified :291071 Dec-2015: 44341
## Mean : 75028 Oct-2014: 38782
## 3rd Qu.: 90000 Nov-2015: 37529
## Max. :9500000 Aug-2015: 35886
## (Other) :636242
## loan_status pymnt_plan purpose
## Current :601777 n:887363 debt_consolidation:524214
## Fully Paid :207723 y: 10 credit_card :206181
## Charged Off : 45248 home_improvement : 51829
## Late (31-120 days): 11591 other : 42890
## Issued : 8460 major_purchase : 17277
## In Grace Period : 6253 small_business : 10377
## (Other) : 6321 (Other) : 34605
## title zip_code addr_state
## Debt consolidation :414000 945xx : 9770 CA :129517
## Credit card refinancing:164330 750xx : 9417 NY : 74082
## Home improvement : 40112 112xx : 9272 TX : 71136
## Other : 31892 606xx : 8641 FL : 60935
## Debt Consolidation : 15760 300xx : 8126 IL : 35476
## Major purchase : 12051 100xx : 7605 NJ : 33256
## (Other) :209228 (Other):834542 (Other):482971
## dti delinq_2yrs earliest_cr_line inq_last_6mths
## Min. : 0.00 Min. : 0.0000 Aug-2001: 6659 Min. : 0.0000
## 1st Qu.: 11.91 1st Qu.: 0.0000 Aug-2000: 6529 1st Qu.: 0.0000
## Median : 17.65 Median : 0.0000 Oct-2000: 6322 Median : 0.0000
## Mean : 18.13 Mean : 0.3144 Oct-2001: 6154 Mean : 0.6946
## 3rd Qu.: 23.95 3rd Qu.: 0.0000 Aug-2002: 6086 3rd Qu.: 1.0000
## Max. :1092.52 Max. :39.0000 Sep-2000: 5918 Max. :33.0000
## NA's :25 (Other) :849705 NA's :25
## open_acc pub_rec revol_bal revol_util
## Min. : 0.00 Min. : 0.0000 Min. : 0 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.: 0.0000 1st Qu.: 6443 1st Qu.: 37.70
## Median :11.00 Median : 0.0000 Median : 11875 Median : 56.00
## Mean :11.55 Mean : 0.1953 Mean : 16921 Mean : 55.07
## 3rd Qu.:14.00 3rd Qu.: 0.0000 3rd Qu.: 20829 3rd Qu.: 73.60
## Max. :90.00 Max. :86.0000 Max. :2904836 Max. :892.30
## NA's :25 NA's :25 NA's :498
## total_acc initial_list_status out_prncp out_prncp_inv
## Min. : 1.00 f:456843 Min. : 0 Min. : 0
## 1st Qu.: 17.00 w:430530 1st Qu.: 0 1st Qu.: 0
## Median : 24.00 Median : 6459 Median : 6456
## Mean : 25.27 Mean : 8403 Mean : 8400
## 3rd Qu.: 32.00 3rd Qu.:13659 3rd Qu.:13654
## Max. :169.00 Max. :49373 Max. :49373
## NA's :25
## total_pymnt total_pymnt_inv total_rec_prncp total_rec_int
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 1915 1st Qu.: 1900 1st Qu.: 1201 1st Qu.: 441.5
## Median : 4895 Median : 4862 Median : 3215 Median : 1073.3
## Mean : 7559 Mean : 7521 Mean : 5758 Mean : 1754.8
## 3rd Qu.:10617 3rd Qu.:10566 3rd Qu.: 8000 3rd Qu.: 2238.3
## Max. :57778 Max. :57778 Max. :35000 Max. :24205.6
##
## total_rec_late_fee recoveries collection_recovery_fee
## Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 0.3967 Mean : 45.92 Mean : 4.881
## 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :358.6800 Max. :33520.27 Max. :7002.190
##
## last_pymnt_d last_pymnt_amnt next_pymnt_d last_credit_pull_d
## Jan-2016:470148 Min. : 0.0 Feb-2016:553404 Jan-2016:730572
## Dec-2015:150861 1st Qu.: 280.2 :252971 Dec-2015: 19308
## : 17659 Median : 462.8 Jan-2016: 78195 Nov-2015: 11490
## Oct-2015: 16000 Mean : 2164.2 Mar-2011: 107 Oct-2015: 10419
## Jul-2015: 14483 3rd Qu.: 831.2 Apr-2011: 101 Sep-2015: 10087
## Nov-2015: 13981 Max. :36475.6 Feb-2011: 91 Jul-2015: 8642
## (Other) :204241 (Other) : 2504 (Other) : 96855
## collections_12_mths_ex_med application_type
## Min. : 0.00000 INDIVIDUAL:886864
## 1st Qu.: 0.00000 JOINT : 509
## Median : 0.00000
## Mean : 0.01438
## 3rd Qu.: 0.00000
## Max. :20.00000
## NA's :141
## verification_status_joint acc_now_delinq tot_coll_amt
## :886864 Min. : 0.000000 Min. : 0
## Not Verified : 281 1st Qu.: 0.000000 1st Qu.: 0
## Source Verified: 61 Median : 0.000000 Median : 0
## Verified : 167 Mean : 0.004991 Mean : 226
## 3rd Qu.: 0.000000 3rd Qu.: 0
## Max. :14.000000 Max. :9152545
## NA's :25 NA's :70272
## tot_cur_bal total_rev_hi_lim
## Min. : 0 Min. : 0
## 1st Qu.: 29853 1st Qu.: 13900
## Median : 80559 Median : 23700
## Mean : 139458 Mean : 32069
## 3rd Qu.: 208205 3rd Qu.: 39800
## Max. :8000078 Max. :9999999
## NA's :70272 NA's :70272
After conducting a preliminary data exploration, we got a much better sense of our dataset. Based on what we learned, we dropped further covariates and re-categorized some of the variables which we will explain in this section.
In our prediction model, we decided to predict loan status. We dichotomized loan status into “Good” and “Bad” based on the following criteria: Good
1. Fully Paid
2. Current
Bad
1. Default
2. Charged Off
3. Late (16-30 days)
4. Late (31-120 days)
Since we are predicting loan status using our model, we decided to drop records with loan_status with values Issued:
query <- loan.dat$loan_status != 'Issued'
loan.dat <- loan.dat[query, ]
loan.dat$loan_status_bin <- NA
query = loan.dat$loan_status == 'Fully Paid' | loan.dat$loan_status == 'Current'
loan.dat$loan_status_bin[query] = 'Good'
query = loan.dat$loan_status == 'Charged Off' | loan.dat$loan_status == 'Default' |
loan.dat$loan_status == 'Late (16-30 days)' | loan.dat$loan_status == 'Late (31-120 days)'
loan.dat$loan_status_bin[query] = 'Bad'
query <- !is.na(loan.dat$loan_status_bin)
loan.dat <- loan.dat[query, ]
loan.dat$loan_status_bin = as.factor(loan.dat$loan_status_bin)
summary(loan.dat$loan_status_bin)
## Bad Good
## 60415 809500
We also converted funded_amnt_inv to be percent funded amount by investors perc_funded_amnt_inv and only use the year from issue date (issue_d):
loan.dat <- loan.dat %>%
mutate(perc_funded_amnt_inv = funded_amnt_inv/funded_amnt,
issue_d = as.character(issue_d),
term = as.character(term)) %>%
mutate(year = as.numeric(str_sub(issue_d, start = -4)))
We reclassified a 36 month loan to “Short” and a 60 month loan to “Long”.
query <- loan.dat$term == ' 36 months'
loan.dat$term[query] = 'Short'
loan.dat$term[!query] = 'Long'
loan.dat$term = as.factor(loan.dat$term)
We also dichotomized tot_coll_amt to 0 and greater than 0 (and replaced the missing values to zero):
query.na <- is.na(loan.dat$tot_coll_amt)
if (sum(query.na) >0){
loan.dat$tot_coll_amt[query.na] = 0
}
loan.dat <- loan.dat %>%
mutate(tot_coll_amt_gt0 = as.factor(tot_coll_amt > 0.))
We recoded the grade variable to be ordinal:
loan.dat$grade_ordinal <- NA
loan.dat <- loan.dat %>%
mutate(grade = as.character(grade))
grades <- c('A', 'B', 'C', 'D', 'E', 'F', 'G')
counter = 1
for (grade in grades){
query <- loan.dat$grade == grade
if (sum(query) > 0){
loan.dat$grade_ordinal[query] = as.numeric(counter)
}
counter = counter + 1
}
sum(is.na(loan.dat$grade_ordinal))
## [1] 0
We also recoded the emp_length variable to be ordinal (and removed records with no employment length information):
loan.dat$emp_length_ordinal <- NA
loan.dat <- loan.dat %>%
mutate(emp_length = as.character(emp_length))
loan.dat <- loan.dat %>% filter(!(emp_length == 'n/a'))
emp_lengths <- c('< 1 year', '1 year',
'2 years', '3 years',
'4 years', '5 years',
'6 years', '7 years',
'8 years', '9 years',
'10+ years')
counter = 1
for (emp_length in emp_lengths){
query <- loan.dat$emp_length == emp_length
if (sum(query) > 0){
loan.dat$emp_length_ordinal[query] = as.numeric(counter)
}
counter = counter + 1
}
sum(is.na(loan.dat$emp_length_ordinal))
## [1] 0
Based on examining the uni-variate plots and the correlation plot, we decided to keep the following 27 covariates:
predictors <- c('loan_amnt', 'funded_amnt',
'grade_ordinal',
'emp_length_ordinal', 'home_ownership',
'annual_inc', 'verification_status',
'purpose',
'addr_state', 'dti',
'delinq_2yrs', 'inq_last_6mths',
'open_acc', 'pub_rec',
'revol_bal', 'revol_util',
'total_acc', 'initial_list_status',
'application_type', 'acc_now_delinq',
'tot_coll_amt_gt0', 'tot_cur_bal',
'total_rev_hi_lim', 'perc_funded_amnt_inv',
'term')
predictors
## [1] "loan_amnt" "funded_amnt" "grade_ordinal"
## [4] "emp_length_ordinal" "home_ownership" "annual_inc"
## [7] "verification_status" "purpose" "addr_state"
## [10] "dti" "delinq_2yrs" "inq_last_6mths"
## [13] "open_acc" "pub_rec" "revol_bal"
## [16] "revol_util" "total_acc" "initial_list_status"
## [19] "application_type" "acc_now_delinq" "tot_coll_amt_gt0"
## [22] "tot_cur_bal" "total_rev_hi_lim" "perc_funded_amnt_inv"
## [25] "term"
outcome <- c('loan_status_bin')
loan.dat <- loan.dat[, c(outcome, predictors)]
We also did a simple uni-variate analysis of the covariates using histograms and boxplots:
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
numeric_cols <- meta_loans$variable[meta_loans$type == 'numeric']
for (col_name in numeric_cols){
plt <- loan.dat %>% ggplot(aes(loan.dat[, col_name])) +
geom_histogram(color = "black") +
ggtitle(col_name) + labs(x=col_name)
print(plt)
}
for (col_name in numeric_cols){
plt <- loan.dat %>% ggplot(aes(x=loan_status_bin, loan.dat[, col_name])) +
geom_boxplot() +
ggtitle(col_name) + labs(x='Loan Status', y=col_name)
print(plt)
}
At this point, we still had a few records with missing data. We initially tried imputing these values using the mice package but due to computational reasons, we decided to drop this idea.
meta_loans <- funModeling::df_status(loan.dat, print_results = FALSE)
meta_loans[order(-meta_loans$p_na),]
# loan.dat = mice(loan.dat, m=1) # let's just impute one dataset
# loan.dat = complete(loan.dat, 1)
Instead, we decided to set the missing values to the mean using the rest of the data:
sum(is.na(loan.dat$loan_status_bin))
## [1] 0
for (j in 1:ncol(loan.dat)){
miss = is.na(loan.dat[,j])
if (sum(miss) > 0){
loan.dat[miss, j] = mean(loan.dat[,j], na.rm=T)
}
}
sum(is.na(loan.dat))
## [1] 0
Here is a final basic summary statistics of the remaining covariates:
## loan_status_bin loan_amnt funded_amnt grade_ordinal
## Bad : 56958 Min. : 500 Min. : 500 Min. :1.000
## Good:769037 1st Qu.: 8375 1st Qu.: 8325 1st Qu.:2.000
## Median :13225 Median :13200 Median :3.000
## Mean :14918 Mean :14904 Mean :2.789
## 3rd Qu.:20000 3rd Qu.:20000 3rd Qu.:4.000
## Max. :35000 Max. :35000 Max. :7.000
##
## emp_length_ordinal home_ownership annual_inc
## Min. : 1.000 ANY : 3 Min. : 3800
## 1st Qu.: 4.000 MORTGAGE:414900 1st Qu.: 47000
## Median : 7.000 NONE : 44 Median : 65000
## Mean : 7.016 OTHER : 141 Mean : 76323
## 3rd Qu.:11.000 OWN : 77211 3rd Qu.: 90000
## Max. :11.000 RENT :333696 Max. :9500000
##
## verification_status purpose addr_state
## Not Verified :256052 debt_consolidation:490336 CA :121460
## Source Verified:315729 credit_card :191638 NY : 68763
## Verified :254214 home_improvement : 47480 TX : 66997
## other : 38973 FL : 55810
## major_purchase : 15973 IL : 33213
## small_business : 9765 NJ : 31263
## (Other) : 31830 (Other):448489
## dti delinq_2yrs inq_last_6mths open_acc
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.: 11.87 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 8.00
## Median : 17.57 Median : 0.0000 Median :0.0000 Median :11.00
## Mean : 18.05 Mean : 0.3161 Mean :0.6848 Mean :11.61
## 3rd Qu.: 23.81 3rd Qu.: 0.0000 3rd Qu.:1.0000 3rd Qu.:14.00
## Max. :380.53 Max. :39.0000 Max. :8.0000 Max. :90.00
##
## pub_rec revol_bal revol_util total_acc
## Min. : 0.0000 Min. : 0 Min. : 0.00 Min. : 2.00
## 1st Qu.: 0.0000 1st Qu.: 6544 1st Qu.: 37.90 1st Qu.: 17.00
## Median : 0.0000 Median : 12018 Median : 56.10 Median : 24.00
## Mean : 0.1885 Mean : 17047 Mean : 55.23 Mean : 25.34
## 3rd Qu.: 0.0000 3rd Qu.: 21020 3rd Qu.: 73.70 3rd Qu.: 32.00
## Max. :86.0000 Max. :2904836 Max. :892.30 Max. :169.00
##
## initial_list_status application_type acc_now_delinq
## f:427301 INDIVIDUAL:825607 Min. : 0.000000
## w:398694 JOINT : 388 1st Qu.: 0.000000
## Median : 0.000000
## Mean : 0.005021
## 3rd Qu.: 0.000000
## Max. :14.000000
##
## tot_coll_amt_gt0 tot_cur_bal total_rev_hi_lim perc_funded_amnt_inv
## FALSE:718547 Min. : 0 Min. : 0 Min. :0.0000
## TRUE :107448 1st Qu.: 33375 1st Qu.: 14878 1st Qu.:1.0000
## Median : 103842 Median : 26000 Median :1.0000
## Mean : 142008 Mean : 32252 Mean :0.9972
## 3rd Qu.: 199264 3rd Qu.: 38100 3rd Qu.:1.0000
## Max. :8000078 Max. :9999999 Max. :1.0000
##
## term
## Long :252704
## Short:573291
##
##
##
##
##
We examined a few associations that we thought would likely be meaningful visually. For example, we would want to include “State” in our model if it were likely to be meaningful in terms of loan status outcome, but with 51 levels, it would add a lot of parameters to our model if not necessary and impact our inference.
state.dat <- loan.dat %>%
group_by(addr_state) %>%
summarise(state.prop.default = mean(loan_status_bin == "Bad", na.rm = TRUE),
state.prop.gradeA = sum(grade_ordinal == 1)/n(),
state.prop.gradeB = sum(grade_ordinal == 2)/n(),
state.prop.gradeC = sum(grade_ordinal == 3)/n(),
state.prop.gradeD = sum(grade_ordinal == 4)/n(),
state.prop.gradeE = sum(grade_ordinal == 5)/n(),
state.prop.gradeF = sum(grade_ordinal == 6)/n(),
state.prop.gradeG = sum(grade_ordinal == 7)/n()) %>%
mutate(state = addr_state)
#head(state.dat)
plot_usmap(data = state.dat, regions = "state", values = "state.prop.default") +
scale_fill_continuous(name = "") +
ggtitle("Proportion of Loan Defaults by State") +
theme(legend.position = "right")
We see some variation from state to state, but did not observe regional trends which would allow us to collapse our states into a lower number of categories.
We also hypothesized that Debt-to-Income Ratio (DTI) might affect loan repayment, and examined the distribution of this variable.
loan.dat %>%
filter(!is.na(loan_status_bin)) %>%
ggplot() +
geom_boxplot(aes(y = dti, x = loan_status_bin,color = grade_ordinal)) +
scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
theme_dark() +
scale_y_continuous(name = "Debt-to-Income Ratio") +
scale_x_discrete(name = "Grade") +
ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")
#excluding 2 outliers
loan.dat %>%
filter(dti < 7500 & !is.na(loan_status_bin)) %>%
ggplot() +
geom_boxplot(aes(y = dti, x = loan_status_bin,color = grade)) +
scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
theme_dark() +
scale_y_continuous(name = "Debt-to-Income Ratio") +
scale_x_discrete(name = "Grade") +
ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")
#sqrt transform
loan.dat %>%
filter(!is.na(loan_status_bin)) %>%
ggplot() +
geom_boxplot(aes(y = sqrt(dti), x = loan_status_bin,color = grade_ordinal)) +
scale_color_brewer(name = "LendingClub Grade",palette = "Blues") +
theme_dark() +
scale_y_continuous(name = "Debt-to-Income Ratio") +
scale_x_discrete(name = "Grade") +
ggtitle("Distributions of DTI by LendingClub Loan Grade and Loan Status")
#overall distribution
loan.dat %>%
filter(!is.na(loan_status_bin)) %>%
ggplot() +
geom_boxplot(aes(y = sqrt(dti), x = loan_status_bin,color = loan_status_bin)) +
scale_color_brewer(name = "Loan Status",palette = "Blues") +
theme_dark() +
scale_y_continuous(name = "Debt-to-Income Ratio") +
scale_x_discrete(name = "Loan Status") +
ggtitle("Distributions of DTI by Loan Status")
In building our prediction model, we considered three different models*:
1. Ridge
2. Lasso
3. Elastic Net
Among the three options, we picked our final model based on the best test performance. We were concerned by the large imbalance in classes in our data set. We only had 7.4% of “Bad” loan status. Therefore, we decided to evaluate our model based on the AUC performance because this metric is less affected by the imbalance in the classes. We kept about a fifth of a data (200,000 records) for testing and used the rest for training our models. We trained our candidate models using 5-fold cross-validation (on the training set) to obtain the optimal tuning parameters and finally tested their performance on the test data set.
*Warning: Due to the large number of records and predictors, the models took a long time to train. The training time was approximately ~ 4 hours.
#setwd(set_dir)
#load('./loan.dat.Rdata')
start_time <- Sys.time()
summary(loan.dat$loan_status_bin)
56958/769037 # only 7.4% of bad status
# This chunk takes a very very long time to run!!!
# I ran this overnight on my laptop and
# saved the results
# That's why eval is set to FALSE
# keep a fifth of the data set to assess test performance
set.seed(1)
test.indx <- sample(1:dim(loan.dat)[1], 200000, replace = FALSE)
train.indx <- setdiff(1:dim(loan.dat)[1], test.indx)
summary(loan.dat$loan_status_bin[test.indx])
summary(loan.dat$loan_status_bin[train.indx])
train.control = trainControl(method="repeatedcv", number=5, repeats=3,
classProbs=T, summaryFunction=twoClassSummary)
models = c("ridge", "lasso", "enet")
n_models = length(models)
AUC = rep(0,n_models)
names(AUC) = models
for (m in 1:n_models) {
print_str <- paste("Training model: ",
models[m], sep='')
print(print_str)
# save our results
if (models[m] == 'ridge'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method="glmnet", metric="ROC", trControl=train.control,
tuneGrid=expand.grid(alpha = 0, lambda = .5 ^ (-20:20)))
fit.ridge <- fit
} else if (models[m] == 'lasso'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method="glmnet", metric="ROC", trControl=train.control,
tuneGrid=expand.grid(alpha = 1, lambda = .5 ^ (-20:20)))
fit.lasso <- fit
} else if (models[m] == 'enet'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method="glmnet", metric="ROC", trControl=train.control,
tuneGrid=expand.grid(alpha = seq(.05,.95,.05), lambda = .5 ^ (-20:20)))
fit.enet <- fit
} else if (models[m] == 'adaboost'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method=models[m], metric="ROC", trControl=train.control)
fit.adaboost <- fit
} else if (models[m] == 'xgbTree'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method=models[m], metric="ROC", trControl=train.control)
fit.xgb <- fit
} else if (models[m] == 'C5.0Cost'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method=models[m], metric="ROC", trControl=train.control)
fit.ccost <- fit
} else if (models[m] == 'svmLinearWeights'){
fit = train(loan_status_bin ~.,
data=loan.dat[train.indx, ],
method=models[m], metric="ROC", trControl=train.control)
fit.svm <- fit
} else('Unknown model type!')
probs = predict(fit, loan.dat[test.indx,], type="prob")
R = roc(loan.dat$loan_status_bin[test.indx], probs$Good)
plot.roc(R, add=(m>1), col=m, lwd=2, main="ROC curves")
legend("bottomright", legend=models, col=1:n_models, lwd=2)
AUC[m] = R$auc
}
AUC
end_time <- Sys.time()
time <- end_time - start_time
time
setwd(set_dir)
save(fit.ridge, fit.lasso, fit.enet, file = "./models.RData")
## [1] "ridge"
## [1] "lasso"
## [1] "enet"
## ridge lasso enet
## 0.7073157 0.7077265 0.7077178
The AUC for all three models were similar. However, lasso model (with \(\alpha=1\) and \(\lambda=0.0001220703\)) had the largest AUC. Therefore, we decided to use the lasso model (with \(\alpha=1\) and \(\lambda=0.0001220703\)) as our final model and retrained the it using these parameters with all data.
x = model.matrix(loan_status_bin ~ ., loan.dat)[,-1]
y = loan.dat$loan_status_bin
final_mod = glmnet(x=x, y=y, family = 'binomial', alpha = 1,
lambda = 0.0001220703)
setwd(set_dir)
save(final_mod, file = "./final_model.RData")
Here is our final lasso model (\(\alpha=1\) and \(\lambda=0.0001220703\)) with the following regression coefficients:
#setwd(set_dir)
#load("final_model.RData")
coef(final_mod)
## 92 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2.126283e+00
## loan_amnt -6.750296e-06
## funded_amnt .
## grade_ordinal -3.122226e-01
## emp_length_ordinal 7.644084e-03
## home_ownershipMORTGAGE 2.317965e-03
## home_ownershipNONE -3.335335e-01
## home_ownershipOTHER -5.775394e-01
## home_ownershipOWN .
## home_ownershipRENT -1.271860e-01
## annual_inc 4.555097e-06
## verification_statusSource Verified 7.753753e-02
## verification_statusVerified -1.087390e-01
## purposecredit_card 1.577752e-01
## purposedebt_consolidation .
## purposeeducational -4.108364e-01
## purposehome_improvement .
## purposehouse -2.484775e-02
## purposemajor_purchase .
## purposemedical -1.028531e-01
## purposemoving -2.943002e-02
## purposeother -3.470342e-02
## purposerenewable_energy -1.101536e-01
## purposesmall_business -5.684232e-01
## purposevacation 2.839905e-02
## purposewedding -1.683520e-01
## addr_stateAL -1.177174e-01
## addr_stateAR .
## addr_stateAZ -3.691371e-02
## addr_stateCA -9.251343e-02
## addr_stateCO 1.610975e-01
## addr_stateCT 8.365280e-02
## addr_stateDC 3.818644e-01
## addr_stateDE .
## addr_stateFL -1.531459e-01
## addr_stateGA 1.720272e-02
## addr_stateHI -1.190482e-01
## addr_stateIA -8.190531e-02
## addr_stateID .
## addr_stateIL 1.824187e-01
## addr_stateIN 3.678528e-02
## addr_stateKS 1.539369e-01
## addr_stateKY 1.526397e-02
## addr_stateLA -8.431524e-02
## addr_stateMA -4.081359e-02
## addr_stateMD -8.468537e-02
## addr_stateME 2.471315e+00
## addr_stateMI .
## addr_stateMN -4.660058e-03
## addr_stateMO .
## addr_stateMS 3.633531e-01
## addr_stateMT 9.628141e-02
## addr_stateNC -7.785504e-02
## addr_stateND 1.600109e+00
## addr_stateNE 1.633703e+00
## addr_stateNH 2.640157e-01
## addr_stateNJ -1.183487e-01
## addr_stateNM -8.512447e-02
## addr_stateNV -2.545912e-01
## addr_stateNY -1.260648e-01
## addr_stateOH 4.187263e-03
## addr_stateOK -4.815880e-02
## addr_stateOR 3.112655e-02
## addr_statePA 2.136455e-03
## addr_stateRI -3.935880e-04
## addr_stateSC 1.443052e-01
## addr_stateSD -1.651235e-03
## addr_stateTN 2.167357e-02
## addr_stateTX 7.209071e-02
## addr_stateUT -1.004820e-01
## addr_stateVA -1.299379e-01
## addr_stateVT 2.788129e-01
## addr_stateWA 4.355436e-02
## addr_stateWI 1.073420e-01
## addr_stateWV 1.579249e-01
## addr_stateWY 2.204878e-01
## dti 8.014140e-04
## delinq_2yrs 1.155930e-02
## inq_last_6mths -1.828347e-01
## open_acc 3.228885e-03
## pub_rec 1.617830e-01
## revol_bal 2.759894e-06
## revol_util -4.939832e-03
## total_acc -1.947432e-03
## initial_list_statusw 5.882670e-01
## application_typeJOINT 2.047160e+00
## acc_now_delinq .
## tot_coll_amt_gt0TRUE 1.816111e-01
## tot_cur_bal 1.381133e-07
## total_rev_hi_lim .
## perc_funded_amnt_inv 1.315427e+00
## termShort 3.201315e-03
We also ran a series of simple logistic regressions to explore factors that are associated with defaulting a loan. We recoded loan status into a dummy variable with 1 being “Bad”. The predictors are:
#recode loan status to a dummy variable, grade and employment length to continuous variables
loan.dat <- loan.dat %>%
mutate(status_dum = ifelse(loan_status_bin == "Good",0,1))
#prepare data subset
varlist <- c('status_dum','funded_amnt','grade_ordinal',
'emp_length_ordinal', 'annual_inc','dti',
'inq_last_6mths','open_acc',
'revol_bal','revol_util','total_acc',
'tot_cur_bal','total_rev_hi_lim')
loan.dat_sub <- loan.dat[, varlist]
#run simple logistic regressions
glm_fun <- function(x) {
fit <- glm(status_dum ~ x, data = loan.dat_sub, family = "binomial")
out <- coef(summary(fit))
return(out)
}
simple_result <- lapply(loan.dat_sub[,2:13],function(x) glm_fun(x))
Here is the result for each logistic regression. Note that the coefficient in in log form.
simple_result_df <- matrix(nrow = 12, ncol = 4)
colnames(simple_result_df) <- c("Estimate","SE","z value","p-value")
rownames(simple_result_df) <- names(simple_result)
for (i in 1:12){
simple_result_df[i,] <- simple_result[[i]][2,]
}
tidy(simple_result_df)
While all of estimates are statistically significant at 5% confidence level, the magnitude of effect is negligible for most of these variables. We present the odds ratio (OR) and 95% confidence intervals for three variables with large effect size: grade, employment length, and Inquiries in the last 6 months.
#this is a function of exponentiate the coefficient and get the confidence intervals of each estimate
glm_CI <- function(x) {
fit <- glm(status_dum ~ x, data = loan.dat_sub, family = "binomial")
out <- exp(cbind(OR = coef(fit), confint(fit)))
return(out)}
Grade
This result suggests that for a one unit increase in grade, which represents a downgrading of one level (i.e., from A to B), the odds of defaulting a loan increases by 49%.
glm_CI(loan.dat_sub$grade_ordinal)
## OR 2.5 % 97.5 %
## (Intercept) 0.02140505 0.02093053 0.02188912
## x 1.49241807 1.48341644 1.50147486
Employment length
For a one-year increase in employment length (up to 10 years), the odds of defaulting a loan decreases by 2%
glm_CI(loan.dat_sub$emp_length_ordinal)
## OR 2.5 % 97.5 %
## (Intercept) 0.08315479 0.08167605 0.08465545
## x 0.98340649 0.98113713 0.98568193
Inquiries in the last 6 months
For a one unit increase in the number of inquiries in past 6 months (excluding auto and mortgage inquiries), the odds of defaulting a loan increases by 29%. A recent inquiry might suggest having insufficient fund to make the upcoming repayment.
glm_CI(loan.dat_sub$inq_last_6mths)
## OR 2.5 % 97.5 %
## (Intercept) 0.06025912 0.0596011 0.06092263
## x 1.29462778 1.2849837 1.30431849
We note that grade is a function of many other factors (e.g. employment length, annual income). Therefore, we also ran a multivariate logistic regression to condition out grade assignment.
We see that after stratifying by grade, the effects of employment length and inquiries in the last 6 months on loan repayment remain statistically significant, although the odds ratio of inquiries in the last 6 months decreases discernibly. This shows that in each grade stratum, the magnitude of the effect of inquiries in the last 6 months on loan default becomes smaller.
reg_fit <- glm(status_dum ~ grade_ordinal + emp_length_ordinal + inq_last_6mths, data=loan.dat, family = "binomial")
exp(cbind(OR = coef(reg_fit), confint(reg_fit)))
## OR 2.5 % 97.5 %
## (Intercept) 0.02310968 0.02248185 0.02375353
## grade_ordinal 1.45616035 1.44715249 1.46522405
## emp_length_ordinal 0.98207577 0.97977778 0.98437994
## inq_last_6mths 1.16910764 1.15996498 1.17829786